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ABSTRACT 


Records  of  26  months  of  operating,  repairing,  and  maintaining  the 
nuclear  accelerator  Superhilac  at  the  Lawrence  Berkeley  Laboratory 
(LBL)  are  analyzed  with  respect  to  system  availability  and  reliability. 

A major  portion  of  the  report  is  devoted  to  building  a suitable  model 
for  the  availability  analysis.  Some  specific  recommendations  for  im- 
provement arc  also  given.  The  current  availabilities  for  operating  the 
machine  in  Modes  1,  2,  and  3 are  64.8%,  76.9%,  and  80%  respectively. 

The  Adam  injector  is  most  responsible  for  causing  the  current  low 
civallabllity  level  of  Mode  1.  An  increase  of  15  hours  on  top  of  the 
current  MTBF  of  22.74  hours  for  the  Adam  Injector  {or,  a decrease  of 
about  2 hours  from  its  current  MTTR  of  5.15  hours)  would  result  in  an 
overall  boost  of  5.6%  for  Mode  1 availability.  (See  Chapter  3 and  . 

> Appendix  2 for  additional  recommendations.)  Either  way,  this  would 
translate  into  about  6 more  usable  hours  every  4 davs  of  operations 
for  Mode  1. 

Some  optimization  schemes  (when  the  costs  associated  with  improving 
the  various  equipments  are  known)  are  Indicated  for  obtaining  specific 
adjustments  to  be  made  systemwide  in  order  to  achieve  economically  an 
assigned  higher  availability  level.  Such  models  and  schemes  can  easily 
be  adapted  to  future  needs  as  the  information  gathering  system  evolves 
in  the  near  future  into  an  organic  computer  network  capable  of  collect- 
ing and  analyzing  much  more  voluminous  and  accurate  data. 

J 

Also  included  are  system  reliability  and  miscellaneous  findings  from 
Time  Series  Analysis  and  Total  Time  on  Test  Plots. 
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CHAPTER  1 


INTRODUCTION 


1.1  The  System 

The  system  to  be  studied  Is  the  Superhllac  nuclear  accelerator  at 
the  Lawrence  Berkeley  Laboratory,  a diagram  of  which  Is  shown  In  Figure  1. 
As  In  the  figure.  It  consists  of  lA  different  categories  of  subsystems. 
This  comes  from  the  way  the  failure  data  are  recorded  and  classified. 
Notice  that  some  subsystems  are  more  or  less  well  defined  physical  blocks, 
like  the  Adam  and  Eve  Injectors,  while  others  are  spread  over  some  or 
all  portions  of  the  entire  accelerator.  Subsystem  lA  (building  power) 

Is  a good  example  of  such  a subsystem.  Needless  to  say,  each  subsystem 
so  defined  Is  a very  sophisticated  piece  of  equipment.  Also,  despite  Its 
physical  appearance,  the  Superhllac  Is  a series  system  since  failure  of 
any  one  of  the  lA  subsystems  causes  the  system  to  go  down.  As  can  be  seen 
the  entire  Journey  of  the  Ion  beams  starts  from  the  Injection  area,  where 
various  types  of  Ions  are  prepared  and  Injected  according  to  the  needs 
of  experiments.  Entering  the  second  region,  composed  of  pre-strlpper , 
stripper,  and  post-strlpper , the  particles  gain  more  and  more  momentum 
as  they  are  stripped  of  their  orbiting  electrons  In  the  carefully  set 
and  tuned  electrical  fields.  Bending  and  turning  as  directed  by  the 
steering  magnets,  they  finally  emerge  at  the  exit,  where  they  are  either 
delivered  directly  to  the  experimenters  requesting  them  or  are  trans- 
ported through  the  transfer  line  to  be  further  accelerated  by  the  Bevatron 

Another  account  of  the  system  from  the  point  of  view  of  mathematical 
model  building  Is  given  In  Chapter  3.  Here,  for  our  purposes.  It  suffices 
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to  mention  that 

(1)  basically  It  is  a series  system; 

(11)  when  the  Adam  Injector  Is  used  In  the  operation,  it  Is  called 

Mode  1;  when  Eve  Is  used  It  Is  called  Mode  2;  finally,  when 

a fraction  of  the  Ion  beams  of  either  Mode  1 or  Mode  2 Is 

taken  for  a third  group.  It  Is  called  Mode  3,  alias,  "parasitic"; 

(111)  Mode  1 and  Mode  2 Ion  beams  can  be  accelerated  Independently 
at  the  same  time  through  the  same  structure  because  of  Che 
time-sharing  facility. 

1.2  The  Data 

The  data  was  compiled  by  Lee  Besse  (9]  from  log  books  kept  over 
the  period  from  January  1974  through  February  1976.  As  the  system  falls, 
the  failure  Is  traced,  according  Co  the  best  knowledge  of  the  repairmen, 
to  the  villain  subsystem.  The  operating  mode,  the  occurrence  time,  Che 
number  of  repair  hours,  Che  up  time  so  far  recorded  for  the  system,  as 
well  as  the  name  of  the  subsystem,  among  other  things,  are  recorded. 

As  mentioned  before,  there  are  14  different  subsystems  that  can  cause 
system  failure. 

For  our  purpose,  the  quantities  of  interest  are  the  up  times  and 
down  times  of  the  system  and  subsystems.  The  way  these  are  computed 
from  the  data  book  Is  explained  in  Appendix  1. 

1.3  The  Need  for  the  Study  and  the  Goal 

LBL  is  a world  famous  laboratory.  Groups  of  scientists  come  not 
only  from  all  over  the  United  States,  but  also  from  many  parts  of  the  world, 
to  do  experiments.  As  can  be  Imagined,  not  everyone  can  get  experimental 
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time  by  merely  submitting  a proposal.  There  exists  a Program  Advisory 
Committee,  formed  of  leading  nuclear  scientists,  which  oversees  the 
utilization  of  the  machine  by  examining  the  proposals  and  assigning 
time  based  on  the  scientific  worth  of  each  proposal.  An  occasional 
Interruption  of  a few  hours  can  of  course  be  tolerated,  but  longer  or 
more  frequent  Interruptions  are  a great  Inconvenience  to  the  group(s) 
Involved.  Also,  the  potential  use  of  the  very  high  energy  heavy  lone 
as  a tool  for  cancer  treatment  also  underlines  the  Importance  of  higher 
availability.  Consequently,  systematic  efforts  are  now  being  made  to 
Improve  availability  and  reliability.  This  study  is  a part  of  such 
efforts. 

Roughly,  the  current  system  availabilities  are  about  70Z.  That 
Is  to  say,  in  the  long  run,  there  are  only  70  usable  hours  per  100  hours 
of  operation,  or  a little  more  than  a day  Is  lost  every  4 days.  It  is 
the  opinion  of  LBL  people  that  a desirable  level  should  be  in  the 
neighborhood  of  95Z. 

The  plan  of  development  of  this  report  is  to  introduce  some  pre- 
liminaries in  Chapter  2 and  then  build  models  for  our  system,  do  analyses, 
and  spell  out  some  specific  recommendations  In  Chapter  3.  Reliability 
Is  briefly  treated  In  Chapter  4.  We  end  this  chapter  by  listing  the 
notation  used  in  this  study. 


5 


1.4  Notation 


MTBF: 


MTTR: 


A : 


A 

n 


A 

♦-v 


n*-'- 


Mean  tine  between  failures,  also  denoted  by  u for 
the  system  or  by  Uj  for  the  1th  subsystem  1 . 

Most  of  the  time,  the  u and  the  are  estimated 

from  the  operating  data  by  simple  averaging.  In  this 
case,  they  are  denoted  by  u or  respectively. 

Mean  tine  to  repair  Is  denoted  by  ^ , v^  or  ^ 
as  the  case  may  be. 

System  availability  In  general  'is  the  limiting  fraction 
of  time  that  a system  Is  up.  It  Is  estimated  by 

1“. 


Jl 


I 


, where  u^  and  d^  are  the  jth  up  times 


and  down  times  In  a fairly  long  period  of  operation  of 
the  system. 

System  availability  of  a series  system  of  2 components, 
which  are  functionally  Independent,  i.e.,  the  failure 
of  one  will  not  shut  off  the  other. 

A generalization  of  A to  the  case  of  a series 

a • 

system  of  n components,  each  of  which  is  structurally 
Independent  of  all  the  others.  We  will  sometimes  write 
A for  A when  no  danger  of  confusion  exists. 

Like  A^ ^ , but  now  failure  of  either  one  turns  off 
the  other,  thus  suspending  the  useful  life  of  the  non- 
falled  component  during  the  repair  of  the  failed 
component . 


: A 


A 

n*  • 


: A 
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R : 


Assumption: 


Convention: 


System  reliability  is  the  probability  that  a system 
will  perform  satisfactorily  for  an  assigned  length 
of  time  or  longer  under  stated  conditions. 

The  random  lifetimes  of  all  the  subsystems  (or 
components)  for  any  model  in  this  report  are  assumed 
atatiatvoally  independent  of  each  other.  This  is  not 
to  be  confused  with  the  functional  independence  stated 
above . 

All  X values  are  absolute,  not  relative  unless  other- 
wise stated.  That  is  to  say,  if  one  system  achieves 
78%  availability  and  another  80%,  then  we  say  they 
differ  by  2%.  But  since  typically  our  availability 
will  be  about  70%  or  higher  this  is  not  a serious 


distinction. 
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CHAPTER  2 

TWO-COMPONENT  SERIES  SYSTEM  AVAILABILITY  - 
TWO  PLEASANT  PROPERTIES 


Many  features  pertaining  to  2-conponent  series  systems  readily 
carry  over  to  n-component  series  systems.  We  presented  the  notation 
A^^  and  A^  ^ in  the  last  chapter.  The  formulas  for  computing  A 
and  A^  ^ In  terms  of  component  parameters  are: 


A 

• * 


1 


1 


V2 

U1II2 


(cf.  14)), 


(cf.  [6]). 


Algebraically,  A^ ^ Is  less  than  or  equal  to  A^  ^ . This  Is  also 

Intuitively  clear,  from  the  definition  of  A^ ^ wherein  one  component 

may  be  operating  unnecessarily  while  the  other  is  being  repaired. 

As  the  notation  suggests,  an  Intermediate  case  Is  easily  Imagined  to 

be  A^  , for  which  the  setup  is  such  that  the  failure  of  one  component 

shuts  off  the  other,  but  not  vice  versa.  Clearly,  A^^  » 

as  A saves  lifetime  relative  to  A , but  wastes  lifetime  with 
-¥ 

respect  to  A^  . When  n “ 3 , the  formulas  for  A^  and  other  inter- 
mediate cases  have  been  derived  In  [8],  under  exponential  assumptions. 

Unfortunately,  as  the  number  of  components  Increases,  the  number  of 
cases  grows  enormously.  The  derivation  of  the  corresponding  availabilities 
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I 


becomes  very  tedious  though  not  impossible  under  exponential Ity  assump- 
tions. However,  although  exact  formulas  are  useful  in  some  applications, 
as  when  very  high  accuracy  is  required,  the  following  example  shows 
that  we  are  very  fortunate  in  that  ^ and  ^ are  so  close  together 
for  medium  high  and  high  availability  systems.  Hence,  for  our  purposes 
it  is  not  necessary  to  calculate  the  exact  value  by  A_^  : 

Example  1: 


1 2 


“ 1 hr.  ^ = 10  hrs. 

V2  “ 3 hrs.  , P2  “ 20  hrs. 


We  note  that  such  values  are  similar  to  some  of  those  of  our  real 
system . 


A, . 


1 1 ; L_ 

1 + -L  1 + 0.1  + 0.15  + 0.015  1.265 

10  20 


0.7905  , 


1 + .1  + ^ 


3^ 

20 


1 + 0.1  + 0.15  1.25 


0.8 


Notice  that  the  two  denominators  differ  from  each  other  by  a second 
order  term,  which  accounts  for  their  closeness.  We  easily  see  that  In 
n-component  cases  the  situation  is  similar.  Therefore,  errors  committed 
in  estimating  A_^  by  A^  ^ are  usually  relatively  small.  So,  in  practical 
applications,  we  are  not  Interested  in  using  the  exact  formulas,  since 
in  the  real  world  other  types  of  errors,  are  likely  to  arise,  e.g.. 


model  errors,  data-collectlng-handllng  errors,  and  other  human  errors. 
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With  more  fully-computerized  data  gathering-analyzing  systems,  the 
last  two  kinds  of  errors  can  be  expected  to  diminish,  leaving  the  model 
errors  the  only  ones  to  worry  about  if  they  exist.  However,  since  the 
entire  process  of  analyzing  the  superhllac  data  Involves  a substantial 
amount  of  human  effort,  involving  many  people  at  many  stages,  we  would 
like  to  see  how  sensitive  the  system  availability  is  with  respect  to 
the  errors  in  calculating  the  parameters.  It  turns  out  to  be  fairly 
insensitive,  as  shown  in  the  next  example: 

Example  2; 

Refer  to  Example  1.  Assume  that  a lOZ  error  has  occurred  in  the 
estimation  of  all  4 parameters,  all  in  the  direction  favoring  A . 

That  is,  we  now  have  v|  “ 0.9  , “ 2 . 7 , = 11  , ” 22  . 

A'  = 0.8233  ^ k'  - K » 3.3% 

• • • • • • 

A'  - 0.8302  ^ A'  - A = 3.02%  . 

Indeed,  A is  very  robust  since  in  general  a 10%  error  is  con- 

sidered rather  high,  and  cancellations  between  errors  are  likely  to  occur. 

To  guard  against  model  errors,  needless  to  say,  one  has  to  check 
that  the  essence  of  the  real  system  is  captured  by  the  model  built  for 
it.  This  is  no  simple  task  since  the  real  world  is  always  complicated. 

The  final  judgement  depends  on  the  performance  of  the  model  versus  actual 
observations . 

To  sum  up,  we  have  two  pleasant  properties.  First,  in  the  ranges 
of  concern,  A^^  and  A^  ^ are  very  close.  Hence  they  suffice  for  cal- 
culating cases  in  between  the  two  extremes  of  A^^  and  A^  . Secondly, 


I 
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they  are  fairly  insensitive  to  errors  in  parameter  estimation.  We  end 
the  chapter  by  recording  here 


A 

n*  • 


n 

n 

1-1 


n^ 


1 + I — 

i-1  ‘^l 


(again  for  any  set  of  mutually  Independent  random  variables) . 
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CHAPTER  3 

SYSTEM  AVAILABILITY  ANALYSIS 


3.1  Model  Building 

In  the  last  chapter  we  have  considered  some  theoretical  models.  We 
now  turn  to  the  real  system.  As  briefly  described  In  Chapter  1,  from 
the  way  the  data  were  recorded,  and  the  way  the  subsystems  were  defined. 

It  Is  only  natural  to  regard  the  system  basically  as  a series  system 

* 

consisting  of  13  subsystems.  However,  certain  complications  do  exist. 
Basically,  this  is  due  to  the  fact  that  there  are  3 modes  In  which  the 
accelerator  is  used  for  accelerating  different  types  of  ions  at  different 
energy  levels  and  for  different  experiments.  For  Input,  Mode  1 uses  only 
the  Adam  Injector  to  inject  pertinent  Ions  to  the  pre-stripper.  Mode  2 
uses  only  the  Eve  Injector  for  Its  purposes,  while  Mode  3,  rightly  termed 
parasitic,  utilizes  either  the  Adam  or  Eve  injector  for  its  own  experiments, 
although  the  ion  types  from  the  two  injectors  need  not  be  always  interchange- 
able for  a given  experiment  requested  by  a group  of  experimenters  in  Mode 
3.  Now,  the  rest  of  the  subsystems  in  Mode  1 are  all  shared  by  Mode  2, 
and  also  by  Mode  3,  of  course.  (See  Table  1.)  This  fact  alone  makes 
an  elegant  treatment  difficult:  Do  we  regard  the  entire  Superhllac  as  a 
whole  consisting  of  all  modes,  or  do  we  have  3 separate  systems  corre- 
sponding to  the  3 modes?  Either  way  we  suffer  from  Inherent  difficulties. 

But  in  the  end  we  compromise  by  considering  the  three  modes  separately 
in  our  model  building  effort,  basically  because  it  lends  itself  readily 


Subsystem  13,  the  experimenter,  has  been  excluded  from  all  the  models 
we  built,  since  the  system  down  time  due  Co  it  has  been  restored  to  up 
time,  for  the  system  really  has  not  failed.  Moreover,  the  subsystem  is 
not  Important  since  MTBF  - 200  hrs.  and  MTTR  - 1.5  hrs. 
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TABLE  1 


^System 

Mode  1 

Mode  2 

Subsystems^ 

N 

1 

Adam  Injector 
(Including  Source  Change) 

- 

2 

Eve  Injector 

(Including  Source  Change) 

3 

Radio  Frequency 

Same 

4 

Magnet  Power  Supply 

Same 

5 

Cooling 

Same 

6 

Vacuum 

Same 

7 

Miscellaneous 

Mechanical 

Same 

8 

Other 

Same 

9 

Computer  Hardware 

Same 

10 

Analog  and  Digital 
Hardware 

Same 

11 

Computer  Software 

Same 

12 

Instrumentation 

Same 

13 

Experimenter  (Deleted) 

Same 

14 

Building  Power 

Same 

1 
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to  simpler  mathematical  treatment,  and  because  each  of  the  three  modes 
is  peculiar  to  a set  of  experiments  of  its  own  - thus,  separate  study  is 
of  interest.  We  note  here  the  fundamental  shortcoming  in  doing  so. 
Although  the  system  considering  Mode  1 by  Itself  is  a series  of  1 2 sub- 
systems 1,  3-12,  and  14,  it  is  not  completely  isolated  in  that  at  other 
times  (sometimes  at  the  same  time)  11  of  its  subsystems  are  used  by  other 
modes  as  well.  This  makes  it  very  hard  to  divide  the  loads  (affecting 
the  calculation  and  Interpretation  of  ^nd  v^'s)  among  the  modes. 

It  Is  beyond  the  authors'  knowledge  to  go  Into  a further  detailed  account 
of  what  the  system  Is,  how  It  operates,  and  how  the  subsystems  are  classi- 
fied. Instead,  we  shall  present  the  results  of  our  analysis,  and  Judge 
the  models  by  their  ability  to  predict.  Let  us  briefly  suninarlze  the 
models  at  hand  and  the  shuttlng-off  relationships  between  subsystems. 
[Appendix  1]. 


Mode  1:  A 12-subsy8tem  series  system. 

1 3 thru  12  14 

Mode  2:  A 12-subsystem  series  system  (11  of  them  are  common  with 
Mode  1). 

o o — o • • ' O ' X) 

2 3 thru  12  14 

Mode  3:  A 13-8ubsystem  parallel-series  system. 


t 

% 


1 

-o- 


3  thru  12  14 


J 
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3.2  Results  of  Analysis 

The  Individual  dovm  times  are  easy  to  take  from  the  data  book.  The 
Individual  up  times  are  taken  In  the  manner  described  In  Appendix  1.  Basic- 
ally, the  planned  maintenance  times  are  completely  excluded  from  the  calcula- 
tion of  up  times  and  down  times.  Two  up  times  separated  solely  by  a planned 
maintenance  up  time  and/or  a period  of  no  operation  are  pieced  together  as  a 
single  up  time.  Similarly,  two  down  times  so  separated  are  added  up  to  form 
one  down  time.  Likewise,  two  pieces  of  component  up  time  Interrupted  by  a 
period  of  suspension  due  to  the  failure  of  a master  component  shutting  It  off 
are  also  lumped  together.  Table  2 Is  based  on  the  data  from  January  1974  to 
February  1976.  Notice  that  the  parameters  estimated  under  the  2 modes,  where 
overlapping  exists,  usually  agree  with  each  other  quite  well.  This  Is  fairly 
strong  evidence  that  the  same  subsystems  3-12  and  14  are  used  by  both  modes. 
The  operating  data  kept  for  Mode  3 did  not  distinguish  which  mode  It  was 
riding  on  when  both  Mode  1 and  Mode  2 were  then  up.  This  prevents  the  cal- 
culation of  the  mode  3 subsystem  parameters  values.  Its  system  MTBF  and 
MTTR,  however,  are  calculated  to  be  6.40  and  1.60  hrs.,  respectively. 

In  Table  3 we  recorded  the  availabilities  obtained  from  system  MTBF’s 

and  MTTR's,  as  well  as  from  the  formulas  A,-  and  A,.,  (but  written 

12*  ♦ 12-*-> 

•k  kk 

as  A^ _ and  A^  for  simplicity).  ’ For  example,  using  A^^  for  Mode  1, 


Computed  by  using  an  approximation  formula  and  estimating  Y^'s  from  Mode  1 
and  2 by  simple  averages,  l.e.,  A 


fi-Zi-  ^ 

i ^ \ 

1 

L 1 '^^ii)l 

1. 1 ’“r- 

1-3 

li<13 

Keep  In  mind  that  the  Ions  of  Mode  1 and  Mode  2 are  not  all  Interchangeable 
for  Mode  3.  This  should  account  for  part  of  the  discrepancy. 


** 


A^ ^ Is  calculated  similarly,  with  the  second  factor  In  A 


changed  to 


14 

n 

1-3 

11^13 


^11 -"^zi 


1 + 


TABLE 


2995  .769 
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from  the  data  of  Table  2,  we  get:  .815  * .965  * .986  x 1.000  x .989  * 

.995  X .933  X .984  * .999  X 1 X .998  x .998  - .698  . 

If  the  models  (together  with  the  shutting-off  relationships  defined 
in  Appendix  1 for  the  real  systems)  were  correct,  the  actual  system 
availabilities  would  fall  somewhere  in  the  intervals  formed  by  the  upper- 
and  lower-  bounds  given  by  the  models.  But  since  they  systematically 
fall  below  the  lower  bounds,  the  models  (together  with  the  shutting-off 
relationships  for  the  real  systems)  are  not  quite  correct.  But  in  view 
of  the  extremely  complex  nature  of  the  Superhllac,  and  in  view  of  the 
following  reasons,  the  difference  between  the  actual  A and  A^ ^ of 
less  than  52  is  rather  remarkable. 

Remarks : 

1.  The  Superhllac  has  been  undergoing  many  changes,  large  and  small, 
of  various  kinds  (1,2,12,  and  13].  Hence  we  are  not  really 
looking  at  the  same  system  over  the  26  months. 

2.  Data  errors  are  present. 

3.  The  actual  shutting-off  relationships  among  subsystems  are  not 
completely  known,  and  change  with  time. 

4.  Coupling  of  Type  1:  We  repeat  that  for  example,  3-12  and  14 
are  used  by  Mode  1,  2,  and  3.  Each  system  is  not  fully  Isolated 
as  assumed  in  the  models. 

5.  Coupling  of  Type  2:  The  lines  of  demarcation  between  the  present 


subsystems  are  not  well-defined.  That  is,  there  might  exist 
overlapping  between  subsystems.  This  could  invalidate  the 
assumed  statistical  Independence  required  by  the  models. 
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6.  Wliereas  the  more  or  less  regular,  2-or-3-t Imes-a-month  main- 
tenance probably  helps  a great  deal,  this  Intervention  also 
causes  the  system  to  depart  from  model  behavior.  It  Is  the 
intuitive  feeling  of  people  close  to  the  system  operation  that 
right  after  the  maintenance  period  the  machine  Is  more  prone 
to  failures. 

7.  Some  limited  simulation  work  (17]  has  shown  that  a 75%  difference 
between  the  estimated  system  availability  and  the  one  given  by 

an  appropriate  formula  in  terms  of  subsystem  parameters  for 
runs  of  a comparable  length  may  easily  be  due  to  chance. 

8.  Others,  known  or  unknown. 

Further  Consideration  of  the  Above  Remarks: 

1.  Remark  1 may  be  unimportant  according  to  some  people  conversant 
with  the  system.  It  is  a fair  statement  that  the  system  by 
and  large  stays  the  same. 

2.  Remark  2 may  be  unimportant  by  a pleasant  property  of  availa- 
bility calculations  considered  in  Chapter  2. 

3.  By  the  other  pleasant  property.  Remark  3 could  be  unimportant 
if  the  true  data  do  not  differ  substantially  from  the  recorded 
data.  It  could  become  Important  if  they  differ  drastically 
ainae  this  aeriouely  affeata  the  eatimation  of  u^'s  . 

As  a matter  of  fact,  we  have  experimented  with  the  assumption 
that  each  component  shuts  off  the  others  upon  failure.  Then 
Che  bo  obtained  are  so  reduced  that  we  obtained 

.654,  .771,  and  .800  for  Mode  1,  2,  and  3 in  that  order  from  the 
formula  predictions  (of  course  using  to  correspond  to 


19 


this  situation).  What  a remarkable  agreement  with  .648,  .769, 
and  .800!  Naturally,  this  does  not  prove  that  the  actual 
shutting-off  relationships  are  that  each  component  shuts  off 
everything  else.  But"  it  underscores  the  Importance  of  under- 
standing the  system  well,  for  it  might  seriously  affect  parameter 
est imat ion . 

4.  4 5.  Remark  4 is  true  and  Remark  5 is  likely  to  be  true.  They  are 

likely  able  to  account  for  a large  portion  of  the  di  crepancles 
if  the  true  reason  does  not  lie  in  Remark  3.  In  the  future, 
much  more  comprehensive  automated  record-keeping  systems, 
which  are  now  being  planned,  can  be  expected  to  easily  over- 
come 5.  Remark  4 will  remain  to  some  extent  a problem,  unless 
more  sophisticated  models  that  can  capture  this  aspect  are 
discovered . 

6.  Remark  6 is  likely  true,  but  perhaps  unimportant. 

7.  Remark  7 is  possible,  but  deemed  unlikely. 

8.  Remark  8 is  likely,  but  perhaps  unimportant. 

To  sum  up,  the  more  serious  errors  are  the  couplings  of  Type  1 and 
2,  and  the  parameter  estimation  problem  stemming  from  Incomplete  knowledge 
of  the  true  shutting-off  relationships.  Overall,  the  formula  ^ 

predicts  fairly  closely.  This  will  be  our  model  on  which  our  recommenda- 
tions are  based. 

I 

c 


i 
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3.3  Some  Specific  Recommendations  for  Improvement 

Before  presenting  tables,  we  have  the  following  points  to  make. 

1.  For  Mode  1,  subsystems  1,  8,  and  3,  (in  descending  order  of 
importance)  are  most  responsible  for  the  current  low  level 
of  , t .698.  (Subtracting  .05  to  give  the  actual  system 
availability,  .648;  it  is  believed  that  an  Increase  in  the 
value  given  by  the  formula  of  ^ would  result  in  about  the 
same  amount  of  increase  in  the  actual  A.)  This  is  obvious 
from  how  the  number  .698  is  arrived  at  (repeated  here  for 
clarity) : 

® (D  ® 

.815  X .965  X .986  x I.OOO  x .989  x .995  x .933 
X .984  X .999  X 1 X .998  x .998  - .698  . 

Notice 

.815  - .815 

.815  X .933  - .760 

.815  X .933  X .965  - .734  . 

Thus,  it  would  be  fruitless  to  improve  the  others  now  unless 
it  is  disproportionately  cheap  to  do  so.  More  formally,  the 
following  expressions  are  self-explanatory. 
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♦ 


48.87  , 


3A 

• • 


3A. , 


0 


0 


1.78  . 


Among  the  three  villains  1,  8,  and  3,  if  improvements  are  res- 
tricted to  MTBF's,  then  initially  1 is  about  50  times  as 
Important  as  3.  If  improvements  are  restricted  to  MTTR's,  then 
initially  1 is  about  twice  as  Important  as  3. 

2.  Improving  the  Adam  injector,  assuming  no  coupling  of  Type  2, 
will  improve  the  availability  of  Mode  1 only,  while  Improving 
the  other  subsystems  3-12  and  14  will  Increase  the  availability 
of  Mode  2 as  well,  due  to  the  coupling  of  Type  1. 

3.  Likewise,  for  Mode  2,  the  top  three  culprits  are  subsystems 
8,  2,  and  3. 

4.  Notice  both  8 and  3 are  mentioned  twice.  They  are  definitely 
worthy  of  improvement.  Subsystem  8 is  "others."  This  under- 
lines the  need  for  further  refining  this  category.  [See  Section 
3.5  of  this  chapter.]  3 is  radio  frequency;  as  mentioned  before, 
both  modes  would  benefit  from  its  improvement. 

i 
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I 

I It  Is  believed  that  an  Increase  in  the  value  given  by  the  formula  of 

A ^ would  result  in  about  the  same  amount  of  increase  in  actual  A . 

We  caution  that  the  improvement  in  A^  is  less  than  additive  when  vary- 
ing the  MTBF  and  MTTR  simultaneously  in  the  same  subsystem.  However, 
the  result  is  more  than  additive  if  we  improve  2 or  more  subsystems  at 
the  same  time. 

I For  other  tables  of  this  sort,  see  Appendix  2. 

' 3.4  An  Optimization  Example 

Example ; 

Consider  the  Adam  Injector  of  Mode  1.  Suppose  that  it  is  desired 
to  vary  the  and  simultaneously  from  their  current  value  of  22.74 

t hrs.  and  5.15  hrs . so  that  the  overall  increase  in  A^^  is  10%.  Assume 

that  all  other  parameters  are  held  fixed.  Suppose  further  that  it  costs 
2 units  of  money  for  1 hr.  of  increase  in  and  1 unit  of  money  for 

1 hr.  of  decrease  in  v^^  . Also,  from  the  viewpoint  of  the  state  of  the 
art,  time,  budget,  etc.,  it  is  considered  Impractical  to  boost  the  MTBF 
beyond  32.74  hrs.  or  to  reduce  the  MTTR  below  2 hours.  Does  a solution 
exist  for  the  desired  10%  Increase  in  ? If  yes,  what  is  it? 

1 Solution: 


Let  X be  the  number  of  hours  we  should  Increase  MTBF.  Let  y be 
the  number  of  hours  we  should  decrease  MTTR.  Then 


So  the  problem  Is  to 


Minimize  2x  + y 

!0.073x  + y - 3.49  , 

10  ^ X ^ 0 , 

3.15  ^ y ^ 0 , 

where  2x  + y is  the  cost  incurred. 

This  is  a linear  progranming  problem,  but  since  it  will  in  general 
become  nonlinear  when  2 or  more  subsystems  are  considered,  or  when  the 
costs  are  nonlinear,  we  will  solve  it  in  that  setting.  By  the  method  of 
p.  233-234  [18]  (similar  to  the  idea  of  Lagrange  multipliers),  we  get 

y - 3.15  hrs. 

X » 4.66  hrs. 

The  cost  “ 2(4.66)  + 1(3.15)  = 12.47  units. 

In  general  we  will  need  a computer  to  solve  realistic  problems. 
Standard  algorithms  for  such  nonlinear  programming  problems  exist,  as  we 


have  indicated 


in  the  above  formulation. 
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CHAPTER  k 

SYSTEM  RELIABILITY 

We  briefly  justify  the  use  of  the  exponential  function  in  the 

“t /mtbf 

expression  R - A • e , where  R stands  for  reliability;  i.e., 

the  probability  that  a given  system  or  subsystem  will  survive  at  least 
t hours  of  mission  under  stated  operating  conditions,  and  A is  the 
availability  of  the  system  or  subsystem.  We  need  to  verify  that  the  up 
time  distribution  is  exponential.  For  large  and  complicated  systems  or 
subsystems  such  as  those  found  in  the  LBL  Superhilac  there  is  a strong 
theoretical  basis  for  the  exponential  assumption  to  be  true  [4].  This  is 
also  indicated  by  the  Total  Time  on  Test  Plots  (see  sample  plots  in 
Appendix  4].  The  use  of  these  plots  is  justified  by  our  Time  Series  Analysis 
findings  that  all  up  times  are  uncorrelated.  [Appendix  3.] 

Previously  we  have  seen  that  increasing  MTBF  or  decreasing  MTTR  will 
result  in  increasing  availability.  It  is  clear  from  the  above  formula 
that  the  MTBF  holds  an  extra  edge  over  the  MTTR  in  increasing  system 
reliability.  This  should  be  exploited  wherever  possible. 

When  solving  an  optimization  problem  of  the  type  given  in  the  last 

— t/KTBF 

chapter,  we  can  add  the  additional  constraint  that  R » Ae  be  ^ 

a certain  assigned  probability  for  a given  t . 
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APPENDIX  1 


The  exact  shutting-off  relationships  between  subsystems  of  the 
Superhilac  are  not  known.  The  relationships  also  change  with  time  as 
more  monitoring  and  controlling  computers  are  wired  to  various  parts 
of  the  accelerator.  Hence,  the  following  chart  represents  only  a partial 
description  of  what  really  happens. 


Subsystem 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


Other  Subsystems  Shut  Off 
None 
None 
None 
None 
None 
None 
None 
None 

4,  10,  11,  and  12 
4 
4 

4,  9,  and  11 
None 
1-13 


The  up  times  of  subsystems  are  calculated  with  the  above  chart  in 


I 

i 


mind  so  as  to  estimate  the  subsystem  parameters  as  accurately  as  possible. 
For  example: 


I 


'}0 


System 
Mode  1 


Subsystem 


UP 


DOWN 


UP 


: — I 

|due  to 
I Sub- 
• system 
9’s 

. Eall- 
, ure 


3 hrs . 


Maintenance : 
Time  Out 


2 hrs. 


1 hr. 


4 DOWN 


FIGURE  2 


In  the  above  Up  Time  and  Down  Time  Histories,  the  3 pieces  of  up 
time  labeled  3 hrs.,  2 hrs.,  and  1 hr.  of  subsystem  A are  lumped  together 
as  a single  up  time  of  6 hrs.  in  the  calculation  of  the  MTBF. 
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APPENDIX  3 


The  main  results  from  the  Time  Series  Analysis  are: 


1.  (1) 


(il) 


(ill) 


(iv) 


No  significant  autocorrelations  are  found  in  up  times 
and  dovm  times  in  any  of  the  three  systems  (Mode  1,  2, 
and  3). 

No  significant  autocorrelations  are  found  in  up  times 
and  down  times  in  any  subsystems  of  Mode  2.  (Only 
2,  3,  4,  6,  7,  8,  9,  and  13  are  analyzed;  the  others 
do  not  have  enough  data  points  to  provide  a meaningful 
analysis. ) 

No  significant  autocorrelations  are  found  in  up  times 
of  any  subsystems  of  Mode  1.  (Only  1,  3,  4,  6,  7,  8, 
9,  and  13  are  analyzed.) 

No  significant  autocorrelations  are  found  in  the  down 
times  of  any  subsystems  of  Mode  1 except  subsystem  13 
(experimenter),  the  serial  correlation  of  which  is 
given  by  » 0.22Z^_j^  + 0.78Z^  2 ^ ^t  ' 


The  above  results  imply  that  the  use  of  Total  Time  on  Test  Plots 
to  analyze  the  distributions  of  up  times  and  down  times  is  practically 
Justified  in  all  cases  but  one.  It  also  means  in  particular  that 
any  two  consecutive  up  times  are  well-shielded  from  each  other  by  the 
intervening  repair  work  and  possibly  also  by  regular  maintenance  work. 
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2.  No  cross  correlations  are  found  between  any  pair  of  up  time 
and  down  time  sequences,  i.e.,  for  all  three  modes  and  for  all 
subsystems  of  both  Mode  1 and  Mode  2. 

This  has,  in  particular,  partially  verified  the  requirement 
that  all  random  variables  in  our  models  be  Independent  of  each 
other . 

This  also  confirms  that  each  repair  is  fairly  complete  so  that 
one  cannot  link  in  any  manner  the  immediately  following  up 
time  to  it  [ 23] . 

3.  The  panoramic  graphs  of  the  up  time  and  down  time  sequences 
of  Mode  1,  2,  and  3 for  the  entire  period  of  26  months  are 
attached  here  for  easy  comparison. 

The  time  series  analysis  was  made  using  the  computer  codes  described 


in  [ 2A]  . 
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APPENDIX  A 

RESULTS  FROM  TOTAI.  TIME  ON  TEST  PLOTS 

1.  The  following  Total  Time  on  Test  Plots  from  ups  and  downs  of 
Mode  1,  2,  and  3;  of  subsystems  1,  3,  and  8 of  Mode  1;  of 
subsystems  2,  3,  and  8 of  Mode  2 are  typical. 

2.  All  up  time  graphs  are  closely  approximated  by  either  (a)  or 
(b). 


(a)  (b) 

(I)  In  case  (a),  the  up  time  distribution  is  approximately 

“ t / MT  BF* 

exponential.  This  justifies  the  use  of  R = A*  e 
In  particular,  notice  that  the  up  times  of  all  3 modes 
have  a Total  Time  on  Test  Plot  like  (a). 

(II)  In  case  (b) , the  up  time  distribution  is  either  DFR, 
(decreasing  failure  rate),  or  (more  likely)  a mixture 

of  some  distributions,  possibly  exponentials.  From  people 
familiar  with  the  system  it  is  known  that  in  some  50% 
of  the  repair  cases,  better  designs  are  used  to  replace 
the  failed  pieces  of  equipment.  Hence  the  up  times 
following  these  repairs  constitutes  a mixed  population. 
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3.  All  dovm  time  graphs  are  more  or  less  like 


(1)  Roughly,  these  can  be  identified  as  lognormal. 

(ii)  The  initial  rather  long  flat  portions  of  the  graph  are 
due  to  rounding  off  to  0.5  hrs.  and  1.0  hrs.  all 
points  in  their  vicinity.  Hence  the  repair  distributions 
may  actually  be  DFR  rather  than  lognormal.  This  is  due 
to  the  limitation  of  the  current  data-recording  system, 
and  will  be  remedied  in  the  future  when  the  data  collec- 
tion is  computerized. 


